Exploratory Analysis of Wines
Posted on Mar 15 janvier 2019 in Data Analysis
Wine Review Analysis: Worldwide & France¶
Data:
Source of Data Dimension: 10 columns and 150k rows of wine reviews.
Context:
The data was scraped from WineEnthusiast on November 22nd, 2017 by zackthoutt. WineEnthusiast collect wine reviews from a team of experienced wine taster, the dataset represent a set of wine reviews with ratings and description of the wine tasted.
Goal:
Exploratory Analysis of wines:
- Distributions.
- Influence of wine ratings over prices.
- Most expensive variety.
- Best variety rating.
Country: The country that the wine is from description
designation: The vineyard within the winery where the grapes that made the wine are from
points: The number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say they only post reviews for wines that score >=80)
price: The cost for a bottle of the wine
province: The province or state that the wine is from
region_1: The wine growing area in a province or state (ie Napa)
region_2: Sometimes there are more specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank
variety: The type of grapes used to make the wine (ie Pinot Noir)
winery: The winery that made the wine
Table of Contents:¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import missingno as msno
import matplotlib as mpl
%matplotlib inline
import warnings
warnings.filterwarnings(action="ignore", category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
# Customize Matplotlib
mpl.rcParams['font.size'] = 14
mpl.rcParams["figure.figsize"] = (10,8)
df = pd.read_csv('../input/winemag-data_first150k.csv',index_col=0)
df.head()
1) Overall Exploration¶
Visualize nullity by column¶
msno.bar(df)
Country Distribution¶
sns.countplot(y="country", data=df, order=df.country.value_counts().iloc[:10].index)
plt.show()
Variety Distribution¶
sns.countplot(y="variety", data=df, order=df.variety.value_counts().iloc[:10].index)
plt.show()
TOP3 Variety by Country¶
groupby_variety = df.groupby(["country", "variety"])["variety"].count()
top_variety = groupby_variety.groupby(level=0, group_keys=False)
top3_variety = top_variety.nlargest(3)
df_top3_variety = top3_variety.loc[["US", "Italy", "France", "Spain","Chile"],:].to_frame()
df_top3_variety.columns = ["Frequency"]
df_top3_variety.index.names = ['Countries','Variety']
df_top3_variety.reset_index(inplace=True)
sns.catplot(x="Countries", y="Frequency", hue="Variety", data=df_top3_variety,
height=7, kind="bar", palette="muted")
Price Distribution¶
percentage_NaN_price = round(df.price.isnull().sum() / df.shape[0], 2)
print("Percentage of NaN Price: {}".format(percentage_NaN_price))
df.price.fillna(0, inplace=True)
ax = sns.distplot(np.log10(df.price + 1))
plt.title("Price Log distribution")
plt.show()
The majority of wine prices are between 10 and 100$(logarithm base 10).
Points Distribution¶
df.points.describe()
ax = sns.distplot(df.points)
plt.show()
WineEnthusiast only post wines with a range >= 80. The following scale rating is used:
- 95-100 Classic: a great wine
- 90-94 Outstanding: a wine of superior character and style
- 85-89 Very good: a wine with special qualities
- 80-84 Good: a solid, well-made wine
Wine reviews are based on blind tastings. Different editors cover the same region year to year. This rating is not a famous one, like Robert Parker or Revue du Vin de France.
50% of wine reviews are between 86-90 points. As we could imagine there is very small number of great wines >= 95.
points_percentage = df.points.value_counts(normalize=True) * 100
great_wines_percentage = round(points_percentage.loc[95:100].sum(), 2)
print("Proportion of great Wines>= 95: {}".format(great_wines_percentage))
Influence of Wine Ratings over Price¶
ax = sns.boxplot(x="points", y="price", data=df)
plt.title("Wine Reviews vs Price")
plt.show()
subset_ratings_step5 = df.loc[df.points.isin([80, 85, 90, 95, 100]), ["price","points"]]
ax = sns.boxplot(x="points", y="price", data=subset_ratings_step5)
plt.ylim([0,1000])
plt.show()
subset_ratings_step2 = df.loc[df.points.isin([80, 82, 84, 86, 88, 90]), ["price","points"]]
ax = sns.boxplot(x="points", y="price", data=subset_ratings_step2)
plt.ylim([0,200])
plt.show()
We don't see a clear influence of low ratings <= 88 over prices, whereas high ratings influence clearly price.
Let's do an ANOVA Test to find relationships between ratings and prices.
ANOVA will find if there is a real difference between price means and ratings by testing if there is a significant difference between ratings.
#ANOVA
model = smf.ols(formula='points ~ price', data=df)
results = model.fit()
print("The ANOVA Test find a significant difference between these 2 variables. \n"
"F-statistic: {} F-pvalue: {}".format(results.fvalue, results.f_pvalue))
Most expensive variety¶
median_top20_variety = df.loc[:, ["variety", "price"]] \
.groupby("variety") \
.median() \
.sort_values(by="price", ascending= False) \
.index[:20] \
.values
plt.figure(figsize=(10,8))
ax = sns.boxplot(x="price", y="variety",
data=df.loc[df["variety"].isin(median_top20_variety),:],
order=median_top20_variety
)
ax = sns.swarmplot(x="price", y="variety",
data=df.loc[df["variety"].isin(median_top20_variety),:],
color=".25")
plt.title("Price Distribution by Variety")
plt.xlim([0,800])
plt.show()
I notice that there is some mistakes in the fied Variety. Some wine reviewers has labeled variety with:
- A mix of grap variety like Cabernet Shiraz
- Name of vineyards. A variety should be a variety of graps and that's it.
Very few wine reviews for some variety, high dispersion. we can't really say with precision which variety is the most expensive
Wineries with the biggest number of unique wines TOP20¶
top20_wineries = df.winery.value_counts(normalize=True).nlargest(20) \
.sort_values(ascending= False).index
plt.figure(figsize=(12,8))
sns.countplot(y="winery", data=df[df.winery.isin(top20_wineries)],
order=top20_wineries, hue="country")
plt.show()
US and France wineries get a lot of different wines. Cameron Hughes winery is the only with wine produced all over the world.
2) Zoom on the vineyard of France¶
french_df = df.loc[df["country"] == "France",:]
french_df.shape
Which variety is the most expensive ?¶
median_price_variety_fr = french_df.loc[:, ["variety", "price"]] \
.groupby("variety") \
.median() \
.sort_values(by="price", ascending= False)
median_price_top10_variety_fr = median_price_variety_fr.index[:10]
plt.figure(figsize=(10,8))
ax = sns.boxplot(x="price", y="variety",
data=french_df.loc[french_df["variety"].isin(median_price_top10_variety_fr),:],
order=median_price_top10_variety_fr
)
plt.title("Price Distribution by Variety")
plt.xlim([0,800])
plt.show()
- Carignan-Syrah and Muscadelle has too few observations to say it's really an expensive variety.
- So Marsanne is the most expensive variety considering it's Median and it's InterQuartile Range. It's a white wine variety, not a famous variety, that's surprising !
- Champagne worldwide famous, associated to the French luxury, is not the most expensive as I thought. There is always an opportunity to drink champagne, as Lily Bollinger said - “I only drink champagne when I’m happy, and when I’m sad. Sometimes I drink it when I’m alone. When I have company, I consider it obligatory. I trifle with it if I am not hungry and drink it when I am. Otherwise I never touch it—unless I’m thirsty.”
Which variety has the best ratings for the most frequent variety ?¶
top10_variety_fr = french_df.variety.value_counts(normalize=True).nlargest(10) \
.sort_values(ascending= False).index
median_ratings_top10_variety_fr = french_df.loc[french_df["variety"]
.isin(top10_variety_fr), ["variety", "points"]] \
.groupby("variety") \
.median() \
.sort_values(by="points", ascending= False) \
.index
plt.figure(figsize=(10,8))
ax = sns.boxplot(x="points", y="variety",
data=french_df.loc[french_df["variety"].isin(top10_variety_fr),:],
order=median_ratings_top10_variety_fr
)
plt.title("Ratings by Variety")
plt.show()
Champagne, White and Read Bordeaux get the higghest ratings: 90.
Champagne ratings tends to be higher than others variety, the whisker is more shifted to the right.
Wineries with the biggest number of unique wines¶
top20_wineries_fr = french_df.winery.value_counts(normalize=True).nlargest(20) \
.sort_values(ascending= False).index
plt.figure(figsize=(12,8))
sns.countplot(y="winery", data=french_df[french_df.winery.isin(top20_wineries_fr)],
order=top20_wineries_fr, hue="province")
plt.show()
Burgundy (Bourgogne in French) is over represented, it came from the particularity of "climat" from this region.
A climat is a wine terroir, a small parcel of land with it's own particularities: climatic and geologic. There is more than 1000 different climat in Burgundy.